Machine Learning with Python and Scikit Learn

Follow presentation on: http://bit.ly/ML-Notebook

And all the code is available at: http://bit.ly/ML-SpringCampus

















What is Machine Learning?

The process of teaching computers to learn from data.

Learning tasks:

  • Clustering

  • Regression

  • Outlier Detection

  • Classification

  • Time series prediction

  • ....













Supervised vs Unsupervised Learning





















Let's look at some code:

We will use:


In [1]:
import warnings

import numpy as np
import pandas as pd

from time import time

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

import Utils
from Utils import cmap_light
from Utils import cmap_bold


Import done









Boston House Prices


In [2]:
boston_dataset = datasets.load_boston()
print(boston_dataset.DESCR)


Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)
















The data

Dataset:

a set of examples each characterized by features and usually there is a label variable that you want to predict.
















Let's see a real example:


In [3]:
X = boston_dataset.data
Y = boston_dataset.target

names = list(boston_dataset.feature_names) + ['Price']

labels = np.reshape(Y,
                     (Y.shape[0], 1))
df = pd.DataFrame(data=np.concatenate((X, labels), axis=1),
                 columns=names)
df.head(10)


Out[3]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT Price
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3.0 222.0 18.7 394.12 5.21 28.7
6 0.08829 12.5 7.87 0.0 0.524 6.012 66.6 5.5605 5.0 311.0 15.2 395.60 12.43 22.9
7 0.14455 12.5 7.87 0.0 0.524 6.172 96.1 5.9505 5.0 311.0 15.2 396.90 19.15 27.1
8 0.21124 12.5 7.87 0.0 0.524 5.631 100.0 6.0821 5.0 311.0 15.2 386.63 29.93 16.5
9 0.17004 12.5 7.87 0.0 0.524 6.004 85.9 6.5921 5.0 311.0 15.2 386.71 17.10 18.9

In [4]:
df_tmp = df[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX',
             'RM', 'Price']]
df_tmp.head(10)


Out[4]:
CRIM ZN INDUS CHAS NOX RM Price
0 0.00632 18.0 2.31 0.0 0.538 6.575 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 36.2
5 0.02985 0.0 2.18 0.0 0.458 6.430 28.7
6 0.08829 12.5 7.87 0.0 0.524 6.012 22.9
7 0.14455 12.5 7.87 0.0 0.524 6.172 27.1
8 0.21124 12.5 7.87 0.0 0.524 5.631 16.5
9 0.17004 12.5 7.87 0.0 0.524 6.004 18.9











In [5]:
df_tmp.describe()


Out[5]:
CRIM ZN INDUS CHAS NOX RM Price
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.593761 11.363636 11.136779 0.069170 0.554695 6.284634 22.532806
std 8.596783 23.322453 6.860353 0.253994 0.115878 0.702617 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 21.200000
75% 3.647423 12.500000 18.100000 0.000000 0.624000 6.623500 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 50.000000











In [6]:
from Utils import plot_boston_dataset
plot_boston_dataset(boston_dataset.data, 
                    boston_dataset.target)




















First model: Linear Regression

The 95% confidence interval suggests Rexthor's dog could also be a cat, or possibly a teapot.





In [7]:
model = LinearRegression()

model.fit(X, Y)

r2 = model.score(X, Y)

print("R^2 value: {:0.3f}".format(r2))


R^2 value: 0.741










Congratulations!











In [8]:
example_n = np.random.randint(0, Y.shape[0])

Utils.describe_example_boston_dataset(X[example_n])

print("\n\nPredicted price: {:2.2f} Real value: {:2.2f}".format(
        model.predict(X[example_n].reshape(1, -1))[0], Y[example_n]))


Feature: CRIM     -    15.29
Feature: ZN       -     0.00
Feature: INDUS    -    18.10
Feature: CHAS     -     0.00
Feature: NOX      -     0.67
Feature: RM       -     6.65
Feature: AGE      -    93.30
Feature: DIS      -     1.34
Feature: RAD      -    24.00
Feature: TAX      -   666.00
Feature: PTRATIO  -    20.20
Feature: B        -   363.02
Feature: LSTAT    -    23.24


Predicted price: 17.73 Real value: 13.90










Distinguishing Species of Iris plants:



Source: Big Cypress National Preserve











In [9]:
iris_dataset = datasets.load_iris()

print("Features: " + str(iris_dataset.feature_names))
print("Classes: " + str(iris_dataset.target_names))

X = iris_dataset.data
y = iris_dataset.target


Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Classes: ['setosa' 'versicolor' 'virginica']

In [10]:
# Load it to a DF
idx = np.random.permutation(150)
y = y[idx]
X = X[idx]

labels = np.reshape(y,
                    (y.shape[0], 1))
df = pd.DataFrame(data=np.concatenate((X, labels), axis=1),
                 columns=iris_dataset.feature_names + ['Class'])
df.head(10)


Out[10]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Class
0 6.4 3.1 5.5 1.8 2.0
1 5.7 4.4 1.5 0.4 0.0
2 4.8 3.0 1.4 0.1 0.0
3 5.6 2.5 3.9 1.1 1.0
4 5.2 3.4 1.4 0.2 0.0
5 6.1 3.0 4.9 1.8 2.0
6 6.5 3.0 5.2 2.0 2.0
7 5.7 2.6 3.5 1.0 1.0
8 5.7 3.0 4.2 1.2 1.0
9 7.4 2.8 6.1 1.9 2.0

In [11]:
df.describe()


Out[11]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Class
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667 1.000000
std 0.828066 0.433594 1.764420 0.763161 0.819232
min 4.300000 2.000000 1.000000 0.100000 0.000000
25% 5.100000 2.800000 1.600000 0.300000 0.000000
50% 5.800000 3.000000 4.350000 1.300000 1.000000
75% 6.400000 3.300000 5.100000 1.800000 2.000000
max 7.900000 4.400000 6.900000 2.500000 2.000000

In [12]:
# Let's take a peak at the data:
plt.figure(figsize=(8,8))
colors = "bry"
for i, color in zip([0, 1, 2], colors):
    idx = np.where(y == i)
    plt.scatter(X[idx, 0], X[idx, 1], c=color, cmap=plt.cm.Paired)
    
plt.text(5.25, 2.20, "Versicolor", fontsize=14)
plt.text(7, 3.5, "Virginica", fontsize=14)
plt.text(4.5, 3.75, "Setosa", fontsize=14)

plt.title("The 3 different Iris species", fontsize=18, 
          fontweight='bold')    
plt.xlabel(iris_dataset.feature_names[0], fontsize=14)
plt.ylabel(iris_dataset.feature_names[1], fontsize=14)

plt.show()



In [13]:
# We will focus identifying only the Iris Setosa
plt.figure(figsize=(8,8))
colors = "br"

idx = np.where(y == 0) # Give me the indices of the Iris Setosa examples

plt.scatter(X[idx, 0], X[idx, 1], c='b', cmap=plt.cm.Paired)
plt.text(4.5, 3.75, "Setosa", fontsize=14)

idx = np.where(y != 0) # where it's not Iris Setosa 
plt.scatter(X[idx, 0], X[idx, 1], c='r', cmap=plt.cm.Paired)
plt.text(7.0, 2.5, "Others", fontsize=14)


plt.title("Scatter plot of Iris Setosa and the others Iris",
          fontsize=18, fontweight='bold')  
plt.xlabel(iris_dataset.feature_names[0], fontsize=14)
plt.ylabel(iris_dataset.feature_names[1], fontsize=14)
plt.show()











Second model Logistic Regression




In [14]:
# We only care about whether each flower is a 
#     Iris Setosa and we are looking only at two of their features

X = iris_dataset.data
y = iris_dataset.target

new_y = y == 0

model = LogisticRegression(random_state=42, verbose=0)

model.fit(X[:,0:2], new_y)

accuracy = model.score(X[:,0:2], new_y)

print("Accuracy: {:0.3f}%".format(accuracy*100))


Accuracy: 99.333%

In [15]:
from Utils import predict_mesh

# Let's take a look at what our model is doing

# First plot the examples
plt.figure(figsize=(8,8))
colors = "br"

idx = np.where(y == 0)
plt.scatter(X[idx, 0], X[idx, 1], c='b', cmap=plt.cm.Paired)
plt.text(4.5, 3.75, "Setosa", fontsize=14)

idx = np.where(y != 0)
plt.scatter(X[idx, 0], X[idx, 1], c='r', cmap=plt.cm.Paired)
plt.text(7.0, 2.5, "Others", fontsize=14)

(xx, yy, Z) = predict_mesh(X, model)
plt.contour(xx, yy, Z, cmap=plt.cm.Paired)


plt.title("Decision Boundary", fontsize=18, fontweight='bold')   
plt.xlabel(iris_dataset.feature_names[0], fontsize=14)
plt.ylabel(iris_dataset.feature_names[1], fontsize=14)
plt.show()











Linear Regression and Logistic Regression

So how do these models work?

Let's start with linear regression:

$$ \hat{y} = w_0 + w_1.x_1 + w_2.x_2 + w_3.x_3$$

Adding a $x_0=1$ we get

$$ \hat{y} = w^T \cdot x $$

For each variable we have a weight, an "importance", and the linear combination of the weights and features results in our estimated value $\hat{y}$.













What about the weights?















Questions?















Logistic Regression

Same model + Classifier function

Model:

$$ \hat{y} = w^T \cdot x $$



Model + classification:

$$ \hat{y} = g(w^T \cdot x) $$














Sigmoid function

$$g(z) = \frac{1}{1 + e^{-z}}$$






In [16]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

x = np.arange(-20, 20, 0.001)
y = sigmoid(x)

plt.figure(figsize=(10,5))
plt.plot(x, y)
plt.title("Sigmoid Function", fontsize=14)
plt.show()











Now to a real world example!










First look at the data

Data from: Whisky Classified


In [17]:
# Read the data file and drop the collumns we don't care about:
whisky_dataframe = pd.read_csv(
    filepath_or_buffer="whiskies.csv", header=0, sep=',',
    index_col=1)
whisky_dataframe.drop(['RowID', 'Postcode', ' Latitude',
                       ' Longitude'], inplace=True, axis=1)
# We get 12 features

whisky_dataframe.head(10)


Out[17]:
Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral
Distillery
Aberfeldy 2 2 2 0 0 2 1 2 2 2 2 2
Aberlour 3 3 1 0 0 4 3 2 2 3 3 2
AnCnoc 1 3 2 0 0 2 0 0 2 2 3 2
Ardbeg 4 1 4 4 0 0 2 0 1 2 1 0
Ardmore 2 2 2 0 0 1 1 1 2 3 1 1
ArranIsleOf 2 3 1 1 0 1 1 1 0 1 1 2
Auchentoshan 0 2 0 0 0 1 1 0 2 2 3 3
Auchroisk 2 3 1 0 0 2 1 2 2 2 2 1
Aultmore 2 2 1 0 0 1 0 0 2 2 2 2
Balblair 2 3 2 1 0 0 2 0 2 1 2 1











In [18]:
whisky_dataframe.describe()


Out[18]:
Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral
count 86.000000 86.000000 86.000000 86.000000 86.000000 86.000000 86.000000 86.000000 86.000000 86.000000 86.000000 86.000000
mean 2.069767 2.290698 1.534884 0.546512 0.116279 1.244186 1.383721 0.976744 1.465116 1.802326 1.802326 1.697674
std 0.930410 0.717287 0.863613 0.990032 0.322439 0.853175 0.784686 0.932760 0.821730 0.629094 0.779438 0.855017
min 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2.000000 2.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000
50% 2.000000 2.000000 1.000000 0.000000 0.000000 1.000000 1.000000 1.000000 2.000000 2.000000 2.000000 2.000000
75% 2.000000 3.000000 2.000000 1.000000 0.000000 2.000000 2.000000 1.000000 2.000000 2.000000 2.000000 2.000000
max 4.000000 4.000000 4.000000 4.000000 1.000000 4.000000 3.000000 4.000000 4.000000 3.000000 3.000000 4.000000











In [19]:
Utils.plot_whisky_histograms(whisky_dataframe)



In [20]:
Utils.plot_whiky_body_correlation(whisky_dataframe)











Let's do a Detour




















The Curse of Dimentionality

More features != Better Results

When the dimensionality increases, the volume of the space increases so fast that the available data become sparse.




















In [21]:
Utils.plot_1d_random_data(0.5, 30)



In [22]:
Utils.plot_2d_random_data(0.5, 30)











Feature selection and extraction



















PCA


In [23]:
n = 50

random_data_1 =np.random.multivariate_normal(
    mean= [0, 0], cov=[[3, 3], [0, 0.5]], size=n)

random_data_2 =np.random.multivariate_normal(
    mean= [6, 6], cov=[[3, 3], [0, 0.5]], size=n)

random_data = np.concatenate([random_data_1, random_data_2], axis=0)
random_labels = np.concatenate([np.ones((n,1)),np.zeros((n,1))], axis=0)

fig = plt.figure(figsize=(8, 8))

plt.scatter(random_data[:, 0], random_data[:, 1],
            c=random_labels, cmap=cmap_light)
#plt.scatter(random_data_2[:, 0], random_data_2[:, 1], c='r')

plt.plot([-5, 10], [-5, 10], 'r--')
plt.plot([5, 0], [0, 5], 'g--')

plt.xlim((-7, 14))
plt.ylim((-7, 14))
plt.title('Random Data with Principal Components', fontsize=16)

plt.xlabel('Random Dimension 1', fontsize=14)
plt.ylabel('Random Dimension 2', fontsize=14)

plt.show()



In [24]:
pca = PCA(n_components=2)
pca.fit(random_data)
transformed_data = pca.fit_transform(random_data)

plt.figure(figsize=(8,6))
plt.scatter(transformed_data[:,0], transformed_data[:,1],
            c=random_labels, cmap=cmap_light)
plt.plot([-10, 10], [0, 0], 'r--')
plt.xlim((-10, 10))
plt.ylim((-5, 5))
plt.title('Transformed Random Data', fontsize=16)
plt.xlabel('Random Dimension 1', fontsize=14)
plt.ylabel('Random Dimension 2', fontsize=14)

plt.show()



In [25]:
pca = PCA(n_components=1)
pca.fit(random_data)
transformed_data = pca.fit_transform(random_data)

plt.figure(figsize=(8,5))
plt.scatter(transformed_data[:,0], np.zeros((2 * n ,1)),
            c=random_labels, cmap=cmap_light)
plt.plot([-10, 10], [0, 0], 'r--')
plt.xlim((-10, 10))
plt.ylim((-5, 5))
plt.title('Transformed Random Data', fontsize=16)
plt.xlabel('Random Dimension 1', fontsize=14)

plt.show()

print("% of variance explained by PCA: {:0.1f}% \
        ".format(
        pca.explained_variance_ratio_[0]*100))


% of variance explained by PCA: 98.5%         


















Model complexity and overfitting


In [26]:
### Adapted from: 
# http://scikit-learn.org/stable/auto_examples/
# linear_model/plot_polynomial_interpolation.html 
# Author: Mathieu Blondel
#         Jake Vanderplas
# License: BSD 3 clause

def f(x, noise=False):
    """ Sinusoidal function to be approximated"""
    if(noise):
        np.random.RandomState(42)
        return np.sin(x) + np.random.randn(x.shape[0])/4
    return np.sin(x)

space_size = 2*np.pi

# generate points used to plot
x_plot = np.linspace(-space_size, space_size, 100)

# generate points and keep a subset of them
x = np.linspace(-space_size, space_size, 100)
rng = np.random.RandomState(42)
rng.shuffle(x)
x = np.sort(x[:10])
y = f(x, True)

# create matrix versions of these arrays
X = x[:, np.newaxis]
X_plot = x_plot[:, np.newaxis]

colors = ['teal', 'green', 'brown', 'blue']
lw = 2

fig = plt.figure(figsize=(12,12))
    

for count, degree in enumerate([1, 3, 5, 10]):
    ax = fig.add_subplot(2, 2, count+1)
    ax.plot(x_plot, f(x_plot), color='cornflowerblue', linewidth=lw,
         label="ground truth")
    ax.scatter(x, y, color='navy', s=30, marker='o',
               label="training points")
    model = make_pipeline(PolynomialFeatures(degree), Ridge(random_state=42))
    model.fit(X, y)
    y_plot = model.predict(X_plot)
    ax.plot(x_plot, y_plot, color=colors[count], linewidth=lw,
             label="degree {0} with r^2 = {1:.2f}".format(
            degree, model.score(X,y)))

    ax.legend(loc='lower left', fontsize=14)
    ax.set_ylim((-5, 5))
plt.show()




















Back to Scotch!

Let's apply what we learned to our dataset


In [27]:
whisky_data = whisky_dataframe.values

pca = PCA(n_components=2, whiten=True) 

# Here whiten means centering the data around 0, 
# which is needed so that PCA works correctly
transformed_data = pca.fit_transform(whisky_data)

In [28]:
print("% of variance explained by each component: \
       \n 1st {:0.1f}% \
       \n 2nd {:0.1f}% \
        ".format(
        pca.explained_variance_ratio_[0]*100, 
        pca.explained_variance_ratio_[1]*100))


% of variance explained by each component:        
 1st 30.1%        
 2nd 19.2%         

In [29]:
fig = plt.figure(figsize=(8,6))
plt.scatter(x = transformed_data[:,0], y=transformed_data[:,1])

plt.xlim((-3, 5))
plt.ylim((-3, 5))

plt.title('Transformed Whisky Data', fontsize=16)
plt.xlabel('Principal Component 1', fontsize=14)
plt.ylabel('Principal Component 2', fontsize=14)

plt.show()




















Predicting whether it has Tobacco taste


In [30]:
labels = whisky_dataframe['Tobacco']
whisky_data = whisky_dataframe.drop('Tobacco', axis=1).values

In [31]:
print("Percentage of Positive Labels: {:.2f}%".format(
        np.sum(labels)/len(labels)*100))


Percentage of Positive Labels: 11.63%













Unbalanced dataset


In [32]:
pca = PCA(n_components=2, whiten=True) 
# Here whiten means centering the data around 0, 
# which is neededso that PCA works correctly
transformed_data = pca.fit_transform(whisky_data)

In [33]:
train_data, test_data, train_labels, test_labels = train_test_split(
    transformed_data, labels, test_size=0.30, random_state=0)

# Without Class weights
classf = LogisticRegression()

# With Class weights
class_weight={0:1, 1: 12}
classf = LogisticRegression(class_weight=class_weight)

classf.fit(train_data, train_labels)

accuracy = classf.score(train_data, train_labels)

print("\n\nTraining Accuracy:\t {:0.3f}%\n\n".format(accuracy*100))

accuracy = classf.score(test_data, test_labels)

print("Test Accuracy:\t\t {:0.3f}%\n\n".format(accuracy*100))



Training Accuracy:	 76.667%


Test Accuracy:		 73.077%


Confusion Matrix


In [34]:
print("\tTraining \n")
predicted_labels = classf.predict(train_data)
cm = confusion_matrix(train_labels, predicted_labels)
Utils.print_cm(cm)

print("\n\tTesting \n")

predicted_labels = classf.predict(test_data)
cm = confusion_matrix(test_labels, predicted_labels)
Utils.print_cm(cm)


	Training 

Real Values  |  Predicted Values     
             |  No Tobacco | Tobacco    
No Tobacco   |     42      |    12 
Tobacco      |      2      |     4 

	Testing 

Real Values  |  Predicted Values     
             |  No Tobacco | Tobacco    
No Tobacco   |     17      |     5 
Tobacco      |      2      |     2 















Cross Validation




In [35]:
class_weight={0:1, 1: 12}

classf = LogisticRegression(random_state=42, 
                            class_weight=class_weight)
#classf = LogisticRegression(random_state=42)

In [36]:
# Select parameters to use in Cross-Validation
classf_cv = classf
data_cv = transformed_data
N_CV = 10

# Cross Validation
t0 = time()
scores = cross_val_score(classf_cv, data_cv, labels, cv = N_CV)
print("Scores: ")
for i, score in enumerate(scores):
    print( '\t' + str(i) + ':\t' + str(score)) 
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))
print("\nCross val done in %0.3fs." % (time() - t0))


Scores: 
	0:	0.666666666667
	1:	0.666666666667
	2:	0.555555555556
	3:	0.666666666667
	4:	0.777777777778
	5:	0.555555555556
	6:	0.5
	7:	0.75
	8:	0.75
	9:	0.5
Accuracy: 0.639 (+/- 0.199)

Cross val done in 0.033s.